Max Piazza
David Orona
Nithya Arumugam
Abhitej Bokka

Data Analysis Project: Modeling Used Car Prices

Introduction

In an ever-volatile market where every dollar counts, the used car market represents a critical sector of the consumer industry. With rising consumer demand and an increasing variety of vehicles entering the secondary market, understanding the factors that influence used car prices is essential. Buyers seek to make informed decisions based on value for money, while sellers aim to maximize returns by accurately pricing their vehicles. Bridging this gap requires a data-driven approach to uncover the relationships between vehicle specifications and market pricing.

This project utilizes the “Vehicle dataset” by Nehal Birla, Nishant Verma, and Nikhil Kushwaha, available on Kaggle at https://www.kaggle.com/datasets/nehalbirla/vehicle-dataset-from-cardekho/data?select=Car+details+v3.csv.

This dataset aggregates detailed information on over 10,000 used cars, including key variables such as fuel type, transmission, engine capacity, mileage, and kilometers driven, alongside categorical variables like seller type, ownership history, and geographic location.

Our analysis is driven by three core objectives:

  • To model the relationship between vehicle specifications (e.g., fuel efficiency, transmission, and fuel type) and their pricing.
  • To identify regional trends and seller-specific factors influencing market prices.
  • To evaluate how performance metrics, such as engine power and fuel efficiency, impact purchasing behavior.

Using statistical modeling techniques, including regression analysis, we aim to deliver a robust and interpretable model that not only predicts car prices but also highlights the most influential factors driving price variations. Our results will shed light on market dynamics, offering actionable insights for both consumers and industry professionals navigating this volatile space.

By the conclusion of this project, we aim to provide a detailed analysis that enhances understanding of the used car market, aiding stakeholders in making informed decisions in an ever-changing economic landscape.

The dataset contains the following attributes:

Attribute Name Description
1 name The make and model of the vehicle (e.g., Hyundai i10, Honda City).
2 year The year the vehicle was manufactured.
3 selling_price The selling price of the vehicle in Indian Rupees (INR).
4 km_driven The total distance the vehicle has been driven (in km).
5 fuel The type of fuel used by the vehicle (“Diesel”, “Petrol”, “LPG”, or “CNG”).
6 seller_type The type of seller (“Individual”, “Dealer” or “Trustmark Dealer”).
7 transmission The type of transmission system (“Manual” or “Automatic”).
8 owner The number of previous owners of the vehicle (“First Owner”, “Second Owner”, “Third Owner”, “Fourth and Above Owner”, or “Test Drive Car”).
9 mileage The fuel efficiency of the vehicle (in either km/l or km/kg).
10 engine The engine displacement capacity (in CC).
11 max_power The maximum power output of the vehicle’s engine, measured in brake horsepower (bhp).
12 torque A pair of torque and RPM values representing the max torque of the vehicle. The torque values are in either Nm or kgm, and the RPM values are either values or value ranges.
13 seats The seating capacity of the vehicle as an integer.

The above attributes can be categorized into numeric and categorical variables. Numeric variables can be separated into discrete and continuous variables.

Attribute Type Attribute Name
Categorical Variables name, fuel, seller_type, transmission, owner
Discrete Numeric Variables year, km_driven, seats
Continuous Numeric Variables selling_price, mileage, engine, max_power, torque

Methods

# Set document-wide options 
options(tibble.width = Inf)  # Make tibbles display all columns

Load dataset

car_details = read_csv("Car details v3.csv")

Total number of observations

nrow(car_details)
## [1] 8128

Sample of data from Vehicle dataset

head(car_details)
## # A tibble: 6 × 13
##   name                           year selling_price km_driven fuel   seller_type
##   <chr>                         <dbl>         <dbl>     <dbl> <chr>  <chr>      
## 1 Maruti Swift Dzire VDI         2014        450000    145500 Diesel Individual 
## 2 Skoda Rapid 1.5 TDI Ambition   2014        370000    120000 Diesel Individual 
## 3 Honda City 2017-2020 EXi       2006        158000    140000 Petrol Individual 
## 4 Hyundai i20 Sportz Diesel      2010        225000    127000 Diesel Individual 
## 5 Maruti Swift VXI BSIII         2007        130000    120000 Petrol Individual 
## 6 Hyundai Xcent 1.2 VTVT E Plus  2017        440000     45000 Petrol Individual 
##   transmission owner        mileage    engine  max_power 
##   <chr>        <chr>        <chr>      <chr>   <chr>     
## 1 Manual       First Owner  23.4 kmpl  1248 CC 74 bhp    
## 2 Manual       Second Owner 21.14 kmpl 1498 CC 103.52 bhp
## 3 Manual       Third Owner  17.7 kmpl  1497 CC 78 bhp    
## 4 Manual       First Owner  23.0 kmpl  1396 CC 90 bhp    
## 5 Manual       First Owner  16.1 kmpl  1298 CC 88.2 bhp  
## 6 Manual       First Owner  20.14 kmpl 1197 CC 81.86 bhp 
##   torque                   seats
##   <chr>                    <dbl>
## 1 190Nm@ 2000rpm               5
## 2 250Nm@ 1500-2500rpm          5
## 3 12.7@ 2,700(kgm@ rpm)        5
## 4 22.4 kgm at 1750-2750rpm     5
## 5 11.5@ 4,500(kgm@ rpm)        5
## 6 113.75nm@ 4000rpm            5

Data cleaning

Excluding missing values from dataset

car_details = na.omit(car_details)
nrow(car_details)
## [1] 7906

Excluding duplicate rows from dataset

sum(duplicated(car_details))
## [1] 1189
car_details = car_details[!duplicated(car_details),]
nrow(car_details)
## [1] 6717

Transforming name variable

# Extract the first word from "name" to get the make of the vehicle
car_details$name = word(car_details$name,1)
# Rename "name" variable to "make"
colnames(car_details)[1] = "make"
# Change datatypes of "make" from character to factor
car_details$make = as.factor(car_details$make)

Transforming mileage variable

# View raw data for "mileage" variable
head(car_details$mileage, n=20)
##  [1] "23.4 kmpl"  "21.14 kmpl" "17.7 kmpl"  "23.0 kmpl"  "16.1 kmpl" 
##  [6] "20.14 kmpl" "17.3 km/kg" "16.1 kmpl"  "23.59 kmpl" "20.0 kmpl" 
## [11] "19.01 kmpl" "17.3 kmpl"  "19.3 kmpl"  "18.9 kmpl"  "18.15 kmpl"
## [16] "24.52 kmpl" "23.0 kmpl"  "19.7 kmpl"  "22.54 kmpl" "21.0 kmpl"
# Drop 86 rows where "mileage" value contains the text "km/kg"
car_details <- car_details[!grepl("km/kg", car_details$mileage), ]
# Extract the numeric value from "mileage" 
car_details$mileage = word(car_details$mileage,1)
# Change data_type of "mileage" from character to numeric
car_details$mileage=as.numeric(car_details$mileage)

# Rename "mileage" to "fuel_efficiency"
colnames(car_details)[which(names(car_details) == "mileage")] = "fuel_efficiency"

# View new data for "fuel_efficiency" variable
head(car_details$fuel_efficiency, n=20)
##  [1] 23.40 21.14 17.70 23.00 16.10 20.14 16.10 23.59 20.00 19.01 17.30 19.30
## [13] 18.90 18.15 24.52 23.00 19.70 22.54 21.00 25.50

Above, the mileage variable is renamed to fuel_efficiency to prevent confusion with the km_driven variable, which represents the quantity normally referred to as vehicle “mileage”.

Transforming engine variable

# View raw data for "engine" variable
head(car_details$engine, n=20)
##  [1] "1248 CC" "1498 CC" "1497 CC" "1396 CC" "1298 CC" "1197 CC" "796 CC" 
##  [8] "1364 CC" "1399 CC" "1461 CC" "993 CC"  "1248 CC" "1061 CC" "1198 CC"
## [15] "1248 CC" "1396 CC" "796 CC"  "1396 CC" "1461 CC" "1498 CC"
# Extract the numeric value from each entry (remove the text "CC" from the end of each entry)
car_details$engine = word(car_details$engine,1)
# Change data_type of "engine" from character to numeric
car_details$engine=as.numeric(car_details$engine)
# View new data for "engine" variable
head(car_details$engine, n=20)
##  [1] 1248 1498 1497 1396 1298 1197  796 1364 1399 1461  993 1248 1061 1198 1248
## [16] 1396  796 1396 1461 1498

Transforming max_power variable

# Transform "max_power" variable
# View raw data for "max_power" variable
head(car_details$max_power, n=20)
##  [1] "74 bhp"     "103.52 bhp" "78 bhp"     "90 bhp"     "88.2 bhp"  
##  [6] "81.86 bhp"  "37 bhp"     "67.1 bhp"   "68.1 bhp"   "108.45 bhp"
## [11] "60 bhp"     "73.9 bhp"   "67 bhp"     "82 bhp"     "88.5 bhp"  
## [16] "90 bhp"     "46.3 bhp"   "88.73 bhp"  "64.1 bhp"   "98.6 bhp"
# Extract the numeric value from each entry of "max_power" (remove the text "bhp" from the end of each entry)
car_details$max_power = word(car_details$max_power,1)
# Change data_type of "max_power" from character to numeric
car_details$max_power=as.numeric(car_details$max_power)
# View new data for "max_power" variable
head(car_details$max_power, n=20)
##  [1]  74.00 103.52  78.00  90.00  88.20  81.86  37.00  67.10  68.10 108.45
## [11]  60.00  73.90  67.00  82.00  88.50  90.00  46.30  88.73  64.10  98.60

Converting character-typed columns to factor type

# Change datatypes of "fuel", "seller_type", "transmission", and "owner"
# from character to factor 
car_details$fuel=as.factor(car_details$fuel)
car_details$seller_type=as.factor(car_details$seller_type)
car_details$transmission=as.factor(car_details$transmission)
car_details$owner=as.factor(car_details$owner)

Converting km_driven to km_driven_in_10k

# Modify the "km_driven" column by dividing the values of each entry by 10,000 and renaming the column to "km_driven_in_10k" 
car_details$km_driven = car_details$km_driven/10000
# Rename km_driven
colnames(car_details)[4]="km_driven_in_10k"
# View data for "km_driven_in_10k" column
head(car_details)[4]
## # A tibble: 6 × 1
##   km_driven_in_10k
##              <dbl>
## 1             14.6
## 2             12  
## 3             14  
## 4             12.7
## 5             12  
## 6              4.5

Converting selling_price to selling_price_in_10k

# Modify "selling_price" column by dividing the values of each entry by 10,000 and renaming the column to "selling_price_in_10k"
car_details$selling_price = car_details$selling_price/10000
# Rename selling_price to "selling_price_in_10k"
colnames(car_details)[3]="selling_price_in_10k"
# View data for "selling_price_in_10k" column
head(car_details)[3]
## # A tibble: 6 × 1
##   selling_price_in_10k
##                  <dbl>
## 1                 45  
## 2                 37  
## 3                 15.8
## 4                 22.5
## 5                 13  
## 6                 44

Dropping torque column

# Drop "torque" column
car_details = car_details[,!names(car_details) %in% "torque"]

Creating make_category variable from make variable

unique(car_details$make)
##  [1] Maruti        Skoda         Honda         Hyundai       Toyota       
##  [6] Ford          Renault       Mahindra      Tata          Chevrolet    
## [11] Datsun        Jeep          Mercedes-Benz Mitsubishi    Audi         
## [16] Volkswagen    BMW           Nissan        Lexus         Jaguar       
## [21] Land          MG            Volvo         Daewoo        Kia          
## [26] Fiat          Force         Ambassador    Ashok         Isuzu        
## [31] Opel         
## 31 Levels: Ambassador Ashok Audi BMW Chevrolet Daewoo Datsun Fiat ... Volvo
  • From the above result we can see that there are 31 unique values for “make”.
  • When we model the data using the independent variable “make”, there will be at least 30 dummy variables as predictors.
  • To reduce the model complexity and to increase interpretability, the car make can be grouped into broader categories as “Budget”, “Mid-Range” or “Luxury” depending on general market perception.
# Create new column "make_category" from "make" column
car_details$make_category = ifelse(car_details$make %in% 
  c("Ambassador", "Ashok", "Daewoo", "Datsun", "Opel", "Fiat"), 
  "Budget",
  ifelse(car_details$make %in% 
           c("Chevrolet", "Maruti", "Renault", "Mitsubishi", "Ford", 
             "Honda", "Hyundai", "Isuzu", "Kia", "Toyota", "Force", 
             "Volkswagen", "Tata", "Skoda", "Jeep", "MG", "Nissan", "Mahindra"), 
         "Midrange", 
         "Luxury"
  )
)

# Convert make_category to a factor
car_details$make_category = as.factor(car_details$make_category)

# Output all levels of the make_category variable
levels(car_details$make_category)
## [1] "Budget"   "Luxury"   "Midrange"
# Create a summary table of makes and their categories and view the mapping
make_category_mapping <- unique(car_details[, c("make", "make_category")])
make_category_mapping_df <- as.data.frame(make_category_mapping)
print(make_category_mapping_df, row.names = FALSE)  
##           make make_category
##         Maruti      Midrange
##          Skoda      Midrange
##          Honda      Midrange
##        Hyundai      Midrange
##         Toyota      Midrange
##           Ford      Midrange
##        Renault      Midrange
##       Mahindra      Midrange
##           Tata      Midrange
##      Chevrolet      Midrange
##         Datsun        Budget
##           Jeep      Midrange
##  Mercedes-Benz        Luxury
##     Mitsubishi      Midrange
##           Audi        Luxury
##     Volkswagen      Midrange
##            BMW        Luxury
##         Nissan      Midrange
##          Lexus        Luxury
##         Jaguar        Luxury
##           Land        Luxury
##             MG      Midrange
##          Volvo        Luxury
##         Daewoo        Budget
##            Kia      Midrange
##           Fiat        Budget
##          Force      Midrange
##     Ambassador        Budget
##          Ashok        Budget
##          Isuzu      Midrange
##           Opel        Budget

Sample of dataset after cleaning

head(car_details)
## # A tibble: 6 × 13
##   make     year selling_price_in_10k km_driven_in_10k fuel   seller_type
##   <fct>   <dbl>                <dbl>            <dbl> <fct>  <fct>      
## 1 Maruti   2014                 45               14.6 Diesel Individual 
## 2 Skoda    2014                 37               12   Diesel Individual 
## 3 Honda    2006                 15.8             14   Petrol Individual 
## 4 Hyundai  2010                 22.5             12.7 Diesel Individual 
## 5 Maruti   2007                 13               12   Petrol Individual 
## 6 Hyundai  2017                 44                4.5 Petrol Individual 
##   transmission owner        fuel_efficiency engine max_power seats make_category
##   <fct>        <fct>                  <dbl>  <dbl>     <dbl> <dbl> <fct>        
## 1 Manual       First Owner             23.4   1248      74       5 Midrange     
## 2 Manual       Second Owner            21.1   1498     104.      5 Midrange     
## 3 Manual       Third Owner             17.7   1497      78       5 Midrange     
## 4 Manual       First Owner             23     1396      90       5 Midrange     
## 5 Manual       First Owner             16.1   1298      88.2     5 Midrange     
## 6 Manual       First Owner             20.1   1197      81.9     5 Midrange

Final structure of the car_details dataset

structure_car_details = data.frame(
  Column = names(car_details),
  Type = unname(sapply(car_details, class)))
print(structure_car_details)
##                  Column    Type
## 1                  make  factor
## 2                  year numeric
## 3  selling_price_in_10k numeric
## 4      km_driven_in_10k numeric
## 5                  fuel  factor
## 6           seller_type  factor
## 7          transmission  factor
## 8                 owner  factor
## 9       fuel_efficiency numeric
## 10               engine numeric
## 11            max_power numeric
## 12                seats numeric
## 13        make_category  factor

Data Analysis

Variable Distribution Analysis

car_details$selling_price_in_10k[car_details$selling_price_in_10k > 720]
## [1] 1000
car_details=subset(car_details,subset = car_details$selling_price_in_10k < 720,)
  • There is one observation, which is quite different from the general pattern of selling price. This can impact the model that we build, hence excluded one observation with selling_price = 1000.
Selling Price distribution
# Histogram of selling prices 
hist(car_details$selling_price_in_10k,xlab="Selling Price (in 10,000 INR)",
     main="Selling Price distribution",
     breaks = 30, 
     col = "lightblue")

  • The above plot is positively skewed, meaning the selling prices for most of the observations are less than or equal to 2 million INR, and there are much fewer observations with selling prices above this amount.
Frequency Distribution of Categorical Variables
make_counts = table(car_details$make)
fuel_type_car_count = table(car_details$fuel)
seller_type_car_count = table(car_details$seller_type)
trans_type_car_count = table(car_details$transmission)
owner_type_car_count = table(car_details$owner)

par(mfrow = c(2, 3))
# Plot 1: Number of cars by make
barplot(sort(make_counts),horiz=TRUE, las = 1,
        xlab = "Number of Cars",
        ylab = "Car Make",
        col = "lightblue",
        cex.names = 0.5,
        main = "Num. of Cars in Each Make")

# Plot 2: Number of cars by fuel type
barplot(sort(fuel_type_car_count), horiz = TRUE, las = 1, cex.names = 0.9,
        xlab = "Number of Cars",col = "lightblue", main = "Num. of Cars in Each Fuel Type")

# Plot 3: Number of cars by seller type
barplot(sort(seller_type_car_count), horiz = TRUE, las = 1, cex.names = 0.8,
        xlab = "Number of Cars",col = "lightblue", main = "Num. of Cars in Each Seller Type")

# Plot 4: Number of cars by transmission type
barplot(sort(trans_type_car_count), horiz = TRUE, las = 1, cex.names = 0.8,
        xlab = "Number of Cars",col = "lightblue", main = "Num. of Cars in Each Trans Type")

# Plot 5: Number of cars by owner type
barplot(sort(owner_type_car_count), horiz = TRUE, las = 1, cex.names = 0.6,
        xlab = "Number of Cars",col = "lightblue", main = "Num. of Cars in Each Owner Type")

Boxplots of Selling Price by Different Categorical Variables
Selling Price by Car Make
# Boxplot of selling price by car make
boxplot(selling_price_in_10k ~ make, data = car_details,col=rainbow(length(unique(car_details$make))),
        las = 2,cex.axis = 0.7,               
        main = "Boxplot of Selling Price by Car Make",
        xlab = "Car Make",
        ylab = "Selling Price (10,000 INR)")

  • The above boxplots indicates a significant variation in selling prices across different car makes. Some brands have much wider price ranges than others.
  • Brands like Mercedes-Benz, BMW, Jaguar, Land Rover and Volvo have the highest median selling prices, while cars like Tata, Maruti, and Daewoo have the lowest median selling prices.
Selling Price by Fuel Type
# Boxplot of selling price by fuel type
boxplot(selling_price_in_10k ~ fuel, data = car_details, cex.axis = 0.8,
        col = rainbow(length(unique(car_details$fuel))),
        main = "Box Plot of Selling Price by Fuel Type",
        xlab = "Fuel Type", ylab = "Selling Price (10,000 INR)")

  • From the above boxplot we can see that the median selling price of Diesel cars are slightly higher than the median selling price of Petrol cars.
Selling Price by Seller Type
# Boxplot of selling price by seller type
boxplot(selling_price_in_10k ~ seller_type, data = car_details, cex.axis = 0.6,
        col = rainbow(length(unique(car_details$seller_type))),
        main = "Box Plot of Selling Price by Seller Type",
        xlab = "Seller Type", ylab = "Selling Price (10,000 INR)")

  • The cars sold by Individuals have the lowest median cost when compared to cars sold by dealers.
Selling Price by Transmission Type
# Boxplot of selling price by transmission type
plot(selling_price_in_10k ~ transmission, data = car_details,
     col = rainbow(length(unique(car_details$transmission))),
     main = "Boxplot of Selling Price by Transmission Type",
     xlab = "Transmission Type", 
     ylab = "Selling Price (10,000 INR)")

  • We can see that there is a difference in the selling price of the Automatic and Manual transmission cars. The cost range of Automatic transmission cars is higher than that of Manual transmission cars.
Selling Price by Owner Type
# Boxplot of selling price by owner type
plot(selling_price_in_10k ~ owner, data = car_details, las = 2, cex.axis = 0.5,
     col = rainbow(length(unique(car_details$owner))),
     main = "Boxplot of Selling Price vs Owner Type",
     xlab = "Owner Type", 
     ylab = "Selling Price (10,000 INR)")

  • The selling prices of cars across different owner types are significantly different.
  • The median price of test drive cars are very high and the rest of the owner types have low median cost
  • The median selling price is in the decreasing order of First Owner, Second Owner, Third Owner, Fourth & above Owner.

Scatter Plots of Selling Price vs Different Numerical Variables

# Scatter plot of selling price vs year and transmission type
colours = ifelse(car_details$transmission == "Automatic", "blue", "red")
plot(selling_price_in_10k ~ year,data=car_details,col=colours,pch=19,
     main = "Selling Price vs Year and Transmission Type",
     xlab = "Year",
     ylab = "Selling Price (10,000 INR)")
legend("topleft", legend = c("Automatic", "Manual"),
       col = c("blue", "red"), pch = 19)

  • We can see a positive correlation between year and selling price. As the year increases, the selling price is increasing.
  • This suggests that newer cars are priced higher than older ones.
  • We can also see that automatic transmission cars have a higher selling price across all years.
# Scatter plot of selling price vs year and make category
colours = c("Budget" = "blue", "Midrange"="green","Luxury"= "orange")
plot(selling_price_in_10k ~ year,data=car_details,
     col = colours[car_details$make_category],pch=19,
     main = "Selling Price vs Year and Make Category",
     xlab = "Year",
     ylab = "Selling Price (10,000 INR)")
legend("topleft", legend = c("Budget","Midrange","Luxury"),
       col = c("blue", "green","orange"), pch = 19)

  • Similar to previous plot, there is a positive correlation between year and selling price.
  • We see that cost of Budget cars, mid range cars and luxury car increases with Year.
# Scatter plot of selling price vs km driven
plot(selling_price_in_10k ~ km_driven_in_10k, data=car_details,
     xlab="Km Driven (10,000 km)",
     ylab="Selling Price (10,000 INR)",
     main="Selling Price vs Km Driven")

# Subset out extreme values
car_details = subset(car_details,car_details$km_driven_in_10k < 100,)

# Scatter plot of selling price vs km driven after removing extreme values
plot(selling_price_in_10k ~ km_driven_in_10k, data=car_details,
     xlab="Km Driven (10,000 km)",
     ylab="Selling Price (10,000 INR)",
     main="Selling Price vs Km Driven (extreme values removed)")

  • The relationship between selling price and kilometers driven doesn’t seem to be strongly linear.
  • But we can see that as km driven increases, the selling price remains in low range.
  • There are 2 observations which are different from the general pattern with values of km driven (150.0000 236.0457). This can been seen in the above plot.
  • These observations can impact the model. Hence those two data points are excluded from the second plot.
# Subset out 15 observations with fuel_efficiency=0 
car_details = subset(car_details, car_details$fuel_efficiency!=0,)

# Scatter plot of selling price vs fuel efficiency
plot(selling_price_in_10k ~ fuel_efficiency, data=car_details,
     xlab="Fuel Efficiency (km/l)",
     ylab="Selling Price (10,000 INR)",
     main="Selling Price vs Fuel Efficiency")

  • Most data points are clustered at fuel efficiency values of 10 to 30 km/l, and there doesn’t seem to be a linear relationship.
  • That is higher fuel efficiency doesn’t indicate higher selling price.

  • From the above plot, we can see that the selling price is high for higher values of engine power
# Scatter plot of selling price vs max power (by fuel type)
colours = c("Petrol" = "blue", "Diesel"="green")
plot(selling_price_in_10k ~ max_power, data=car_details,
     col=colours[car_details$fuel],
     main="Selling Price vs Max Power (by Fuel Type)",
     xlab="Max Power (bhp)",
     ylab="Selling Price (10,000 INR)" )
legend("topleft", legend = c("Petrol","Diesel"),
       col = c("blue", "green"), pch = 19)

  • From the above plot, we can see that the there is a linear relationship between max power and selling price.
  • We can conclude that when the maximum power increases, the selling price of the car increases
  • In general we saw that Diesel powered cars have higher selling price.
  • But from the above plot we see that Petrol powered cars with higher horse power have higher selling price.

Inference from Data Analysis

  • The selling price distribution is positively skewed. Positively skewed data has extreme values which makes it hard to fit models.
    • Hence, logarithmic transformation can make the selling price distribution to be normally distributed.
    • Logarithmic transformations can also help stabilize the variance, making the data more homoscedastic and suitable for analysis.
  • There is a positive correlation between year and selling price.
  • The selling price also tends to increase with engine displacement and max power.
  • Cost of Budget cars, mid range cars and luxury car increases with Year.
  • As the km_driven increases, the selling price tends to decrease.
  • There is no impact of fuel efficiency on selling price
  • The median cost of Automatic transmission cars are higher than Manual transmission cars.
  • The median price of test drive cars are very high when compared to other owner types.
  • Cost of Diesel cars are generally high. But cost of petrol cars with high horse power is also high

Analysis of Correlation Between Numeric variables

# Pairs plot for numeric variables
pairs(selling_price_in_10k ~ year+km_driven_in_10k+fuel_efficiency+engine+max_power+seats ,data=car_details)

# Calculate correlation matrix for numeric variables
cor_mat = cor(car_details[,sapply(car_details,is.numeric)])
cor_mat
##                             year selling_price_in_10k km_driven_in_10k
## year                  1.00000000            0.4395767      -0.45496390
## selling_price_in_10k  0.43957666            1.0000000      -0.19812586
## km_driven_in_10k     -0.45496390           -0.1981259       1.00000000
## fuel_efficiency       0.37352047           -0.1258892      -0.24186898
## engine               -0.02333154            0.4480476       0.30860425
## max_power             0.15993052            0.6857687       0.04886655
## seats                 0.02630866            0.1648269       0.24955889
##                      fuel_efficiency      engine   max_power       seats
## year                       0.3735205 -0.02333154  0.15993052  0.02630866
## selling_price_in_10k      -0.1258892  0.44804759  0.68576871  0.16482693
## km_driven_in_10k          -0.2418690  0.30860425  0.04886655  0.24955889
## fuel_efficiency            1.0000000 -0.59909460 -0.40078564 -0.48552947
## engine                    -0.5990946  1.00000000  0.68399138  0.66312580
## max_power                 -0.4007856  0.68399138  1.00000000  0.25911446
## seats                     -0.4855295  0.66312580  0.25911446  1.00000000
# Extract high correlation values 
high_cor = cor_mat[cor_mat > 0.5 & cor_mat != 1]
high_cor_indices = which(cor_mat > 0.5 & cor_mat != 1, arr.ind = TRUE)
high_cor_df = data.frame(
  row = rownames(cor_mat)[high_cor_indices[, 1]],
  column = colnames(cor_mat)[high_cor_indices[, 2]],
  correlation = high_cor)

# Display first few pairs of high-correlation variables
head(high_cor_df, 6)
##                    row               column correlation
## 1            max_power selling_price_in_10k   0.6857687
## 2            max_power               engine   0.6839914
## 3                seats               engine   0.6631258
## 4 selling_price_in_10k            max_power   0.6857687
## 5               engine            max_power   0.6839914
## 6               engine                seats   0.6631258
  • From the correlation plot and correlation matrix we can see that, following combination of fields have correlation above 0.5
  • (max_power and selling_price) has high correlation of 0.6872307. Hence selling price increase with max_power
  • (max_power and engine) has high correlation of 0.6863027. Hence max power and engine are correlated predictors

Model Development and Validation

In this section, a model for predicting values of selling_price based on the values of the other variables is developed and the performance of the model is analyzed.

Logarithmic transformation of response variable

Logarithmic transformation can make the positively skewed distribution of selling price to be normally distributed and makes it more suitable for model building

car_details$selling_price_in_10k = log(car_details$selling_price_in_10k)

Splitting data into train and test sets

set.seed(125) # For reproducibility
train_indices = sample(nrow(car_details), size = 0.80 * nrow(car_details))
train_data = car_details[train_indices, ]
test_data = car_details[-train_indices, ]
nrow(train_data)
## [1] 5289
nrow(test_data)
## [1] 1323

Additive “full” linear model (using all available predictors except make)

# Definition of "full" linear model with BIC-based stepwise model selection starting from full model (both directions)
n= nrow(train_data)
bic_full_add_model = step(lm(selling_price_in_10k ~ .-make, data = train_data),direction="both",k=log(n),trace=0)

# Model summary
summary(bic_full_add_model)
## 
## Call:
## lm(formula = selling_price_in_10k ~ (make + year + km_driven_in_10k + 
##     fuel + seller_type + transmission + owner + fuel_efficiency + 
##     engine + max_power + seats + make_category) - make, data = train_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.68520 -0.16717  0.02508  0.19042  1.42921 
## 
## Coefficients:
##                               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                 -2.217e+02  3.054e+00 -72.603  < 2e-16 ***
## year                         1.112e-01  1.530e-03  72.642  < 2e-16 ***
## km_driven_in_10k            -3.850e-03  1.033e-03  -3.728 0.000195 ***
## fuelPetrol                  -1.369e-01  1.296e-02 -10.569  < 2e-16 ***
## seller_typeIndividual       -7.010e-02  1.401e-02  -5.003 5.82e-07 ***
## seller_typeTrustmark Dealer  8.294e-02  5.984e-02   1.386 0.165779    
## transmissionManual          -9.709e-02  1.692e-02  -5.737 1.02e-08 ***
## ownerFourth & Above Owner   -1.212e-01  2.880e-02  -4.209 2.61e-05 ***
## ownerSecond Owner           -7.404e-02  9.991e-03  -7.411 1.45e-13 ***
## ownerTest Drive Car          6.114e-01  1.317e-01   4.641 3.55e-06 ***
## ownerThird Owner            -1.087e-01  1.663e-02  -6.536 6.90e-11 ***
## fuel_efficiency              1.214e-02  1.956e-03   6.207 5.83e-10 ***
## engine                       2.566e-04  1.942e-05  13.213  < 2e-16 ***
## max_power                    8.500e-03  2.201e-04  38.616  < 2e-16 ***
## seats                        4.647e-02  6.311e-03   7.362 2.09e-13 ***
## make_categoryLuxury          5.779e-01  4.906e-02  11.778  < 2e-16 ***
## make_categoryMidrange        2.194e-01  3.336e-02   6.576 5.28e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2909 on 5272 degrees of freedom
## Multiple R-squared:  0.8474, Adjusted R-squared:  0.8469 
## F-statistic:  1830 on 16 and 5272 DF,  p-value: < 2.2e-16
#Number of parameters 
length(coef(bic_full_add_model))
## [1] 17
# Diagnostic plots
par(mfrow = c(1, 2))
plot(bic_full_add_model,which=c(1,2))

# Diagnostic tests
bp_test = bptest(bic_full_add_model)
print(bp_test)
## 
##  studentized Breusch-Pagan test
## 
## data:  bic_full_add_model
## BP = 372.41, df = 16, p-value < 2.2e-16
# Subset residuals for Shapiro-Wilk test (up to 5000 samples)
residual_subset = sample(residuals(bic_full_add_model), size = 5000)
shapiro_test = shapiro.test(residual_subset)
print(shapiro_test)
## 
##  Shapiro-Wilk normality test
## 
## data:  residual_subset
## W = 0.98173, p-value < 2.2e-16
# Multicollinearity test
vif(bic_full_add_model)
##                      GVIF Df GVIF^(1/(2*Df))
## year             2.190700  1        1.480101
## km_driven_in_10k 1.634226  1        1.278369
## fuel             2.588377  1        1.608843
## seller_type      1.112922  2        1.027108
## transmission     1.446943  1        1.202889
## owner            1.412292  4        1.044096
## fuel_efficiency  3.683989  1        1.919372
## engine           5.759182  1        2.399830
## max_power        3.026818  1        1.739775
## seats            2.419890  1        1.555599
## make_category    1.562362  2        1.118009
Adjusted R^2 Number of parameters Breusch-Pagan test Shapiro-Wilk normality test
(with 5000 sample residuals)
VIF>5 Non significant predictors(pvalue > 0.05)
Full additive model with BIC
(Excluding “make” predictor)
0.8469423 17 Constant variance assumption
failed with BP = 372.41
Normality assumption f
ailed with W = 0.98173
None seller_typeTrustmark Dealer
  • The “full” linear model has a high \(R^2\) value of 0.8469423, indicating that 84.69% of the variance in selling_price_in_10k is explained by the predictors.

  • Most predictors are statistically significant (p-value < 0.05), suggesting they contribute meaningfully to the model. Except seller_typeTrustmark Dealer (p = 0.165779)

  • Diagnostic testing for this model:

    • The Breusch-Pagan test yielded a very high test statistic of 372.41 and a p-value less than \(2.2*10^{-16}\) which rejects the null hypothesis that constant variance exists in residuals

    • The Shapiro-Wilk test with 5000 sample residuals yielded a low p-value less than \(2.2*10^{-16}\), which rejects the null hypothesis that the residuals are normally distributed.

“Small model” with interaction terms

small_model = lm(selling_price_in_10k ~ make_category*year*max_power + year*transmission + owner + seller_type + km_driven_in_10k+fuel_efficiency, data = train_data)
summary(small_model)
## 
## Call:
## lm(formula = selling_price_in_10k ~ make_category * year * max_power + 
##     year * transmission + owner + seller_type + km_driven_in_10k + 
##     fuel_efficiency, data = train_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.56207 -0.19305 -0.00193  0.20786  1.57063 
## 
## Coefficients:
##                                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                           3.817e+01  8.581e+01   0.445 0.656439    
## make_categoryLuxury                  -3.231e+02  1.393e+02  -2.320 0.020379 *  
## make_categoryMidrange                -2.221e+02  8.540e+01  -2.601 0.009323 ** 
## year                                 -1.792e-02  4.268e-02  -0.420 0.674557    
## max_power                            -3.539e+00  1.410e+00  -2.509 0.012130 *  
## transmissionManual                    2.875e+00  1.071e+01   0.268 0.788462    
## ownerFourth & Above Owner            -1.059e-01  3.143e-02  -3.370 0.000756 ***
## ownerSecond Owner                    -7.255e-02  1.092e-02  -6.644 3.36e-11 ***
## ownerTest Drive Car                   3.003e-01  1.460e-01   2.057 0.039756 *  
## ownerThird Owner                     -1.128e-01  1.817e-02  -6.206 5.87e-10 ***
## seller_typeIndividual                -5.130e-02  1.539e-02  -3.334 0.000861 ***
## seller_typeTrustmark Dealer           1.391e-02  6.520e-02   0.213 0.831115    
## km_driven_in_10k                      1.176e-02  1.040e-03  11.307  < 2e-16 ***
## fuel_efficiency                      -7.519e-04  1.411e-03  -0.533 0.594237    
## make_categoryLuxury:year              1.615e-01  6.919e-02   2.335 0.019587 *  
## make_categoryMidrange:year            1.106e-01  4.247e-02   2.603 0.009257 ** 
## make_categoryLuxury:max_power         3.372e+00  1.516e+00   2.225 0.026152 *  
## make_categoryMidrange:max_power       2.756e+00  1.413e+00   1.951 0.051129 .  
## year:max_power                        1.766e-03  7.014e-04   2.518 0.011843 *  
## year:transmissionManual              -1.432e-03  5.316e-03  -0.269 0.787615    
## make_categoryLuxury:year:max_power   -1.680e-03  7.535e-04  -2.230 0.025805 *  
## make_categoryMidrange:year:max_power -1.371e-03  7.025e-04  -1.951 0.051063 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3174 on 5267 degrees of freedom
## Multiple R-squared:  0.8185, Adjusted R-squared:  0.8178 
## F-statistic:  1131 on 21 and 5267 DF,  p-value: < 2.2e-16
plot(small_model,which=c(1,2))

vif(small_model,type = "predictor")
## GVIFs computed for predictors
##                         GVIF Df GVIF^(1/(2*Df))
## make_category      44.883224 11        1.188761
## year                2.794605 13        1.040318
## max_power          44.883224 11        1.188761
## transmission     2853.840852  3        3.766214
## owner               1.465339  4        1.048920
## seller_type         1.123246  2        1.029482
## km_driven_in_10k    1.391267  1        1.179520
## fuel_efficiency     1.610527  1        1.269065
##                                          Interacts With
## make_category                           year, max_power
## year             make_category, max_power, transmission
## max_power                           make_category, year
## transmission                                       year
## owner                                              --  
## seller_type                                        --  
## km_driven_in_10k                                   --  
## fuel_efficiency                                    --  
##                                                                                              Other Predictors
## make_category                             transmission, owner, seller_type, km_driven_in_10k, fuel_efficiency
## year                                                    owner, seller_type, km_driven_in_10k, fuel_efficiency
## max_power                                 transmission, owner, seller_type, km_driven_in_10k, fuel_efficiency
## transmission                  make_category, max_power, owner, seller_type, km_driven_in_10k, fuel_efficiency
## owner            make_category, year, max_power, transmission, seller_type, km_driven_in_10k, fuel_efficiency
## seller_type            make_category, year, max_power, transmission, owner, km_driven_in_10k, fuel_efficiency
## km_driven_in_10k            make_category, year, max_power, transmission, owner, seller_type, fuel_efficiency
## fuel_efficiency            make_category, year, max_power, transmission, owner, seller_type, km_driven_in_10k

BIC-based stepwise model selection starting from “small model” (both directions)

n= nrow(train_data)
bic_small_model = step(small_model,direction="both",k=log(n),trace=0)
summary(bic_small_model)
## 
## Call:
## lm(formula = selling_price_in_10k ~ make_category + year + max_power + 
##     owner + km_driven_in_10k + make_category:max_power + year:max_power, 
##     data = train_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.5230 -0.1936  0.0028  0.2063  1.5677 
## 
## Coefficients:
##                                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                     -1.783e+02  6.103e+00 -29.210  < 2e-16 ***
## make_categoryLuxury              2.052e+00  2.689e-01   7.633 2.71e-14 ***
## make_categoryMidrange            3.321e-01  2.054e-01   1.617 0.106016    
## year                             8.963e-02  3.031e-03  29.573  < 2e-16 ***
## max_power                       -7.988e-01  7.007e-02 -11.400  < 2e-16 ***
## ownerFourth & Above Owner       -1.139e-01  3.143e-02  -3.624 0.000293 ***
## ownerSecond Owner               -7.651e-02  1.089e-02  -7.028 2.35e-12 ***
## ownerTest Drive Car              3.162e-01  1.443e-01   2.191 0.028531 *  
## ownerThird Owner                -1.189e-01  1.812e-02  -6.562 5.83e-11 ***
## km_driven_in_10k                 1.132e-02  1.029e-03  11.008  < 2e-16 ***
## make_categoryLuxury:max_power   -8.803e-03  2.985e-03  -2.949 0.003202 ** 
## make_categoryMidrange:max_power -1.312e-03  2.857e-03  -0.459 0.646134    
## year:max_power                   4.035e-04  3.479e-05  11.596  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.318 on 5276 degrees of freedom
## Multiple R-squared:  0.8175, Adjusted R-squared:  0.8171 
## F-statistic:  1969 on 12 and 5276 DF,  p-value: < 2.2e-16
plot(bic_small_model,which=c(1,2))

vif(small_model,type = "predictor")
## GVIFs computed for predictors
##                         GVIF Df GVIF^(1/(2*Df))
## make_category      44.883224 11        1.188761
## year                2.794605 13        1.040318
## max_power          44.883224 11        1.188761
## transmission     2853.840852  3        3.766214
## owner               1.465339  4        1.048920
## seller_type         1.123246  2        1.029482
## km_driven_in_10k    1.391267  1        1.179520
## fuel_efficiency     1.610527  1        1.269065
##                                          Interacts With
## make_category                           year, max_power
## year             make_category, max_power, transmission
## max_power                           make_category, year
## transmission                                       year
## owner                                              --  
## seller_type                                        --  
## km_driven_in_10k                                   --  
## fuel_efficiency                                    --  
##                                                                                              Other Predictors
## make_category                             transmission, owner, seller_type, km_driven_in_10k, fuel_efficiency
## year                                                    owner, seller_type, km_driven_in_10k, fuel_efficiency
## max_power                                 transmission, owner, seller_type, km_driven_in_10k, fuel_efficiency
## transmission                  make_category, max_power, owner, seller_type, km_driven_in_10k, fuel_efficiency
## owner            make_category, year, max_power, transmission, seller_type, km_driven_in_10k, fuel_efficiency
## seller_type            make_category, year, max_power, transmission, owner, km_driven_in_10k, fuel_efficiency
## km_driven_in_10k            make_category, year, max_power, transmission, owner, seller_type, fuel_efficiency
## fuel_efficiency            make_category, year, max_power, transmission, owner, seller_type, km_driven_in_10k
bptest(bic_small_model)
## 
##  studentized Breusch-Pagan test
## 
## data:  bic_small_model
## BP = 360.31, df = 12, p-value < 2.2e-16

ANOVA of both models created by BIC-based stepwise selection

Influential points

indicies_to_exclude = unname(which(cooks.distance(bic_small_model) > 4/length(cooks.distance(bic_small_model))))
sum(cooks.distance(bic_small_model) > 4/length(cooks.distance(bic_small_model)))
## [1] 287
train_data_filtered = train_data[-indicies_to_exclude,]

Refitting the BIC Small model without influential points

n= nrow(train_data_filtered)
refitted_wip = lm(log(selling_price_in_10k) ~ .+ make_category*year+ make_category*max_power -make-seats-engine, data = train_data_filtered)
                    #make_category*year*max_power + transmission + owner + seller_type + km_driven_in_10k+mileage, data = train_data_filtered)
summary(refitted_wip)
## 
## Call:
## lm(formula = log(selling_price_in_10k) ~ . + make_category * 
##     year + make_category * max_power - make - seats - engine, 
##     data = train_data_filtered)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.42853 -0.04443  0.00546  0.05217  0.27263 
## 
## Coefficients:
##                                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                     -6.752e+01  1.232e+01  -5.479 4.50e-08 ***
## year                             3.410e-02  6.083e-03   5.605 2.20e-08 ***
## km_driven_in_10k                 1.259e-03  3.329e-04   3.781 0.000158 ***
## fuelPetrol                      -7.216e-02  2.914e-03 -24.765  < 2e-16 ***
## seller_typeIndividual           -1.566e-02  3.982e-03  -3.933 8.49e-05 ***
## seller_typeTrustmark Dealer      2.260e-02  1.664e-02   1.358 0.174520    
## transmissionManual              -8.703e-03  4.803e-03  -1.812 0.070044 .  
## ownerFourth & Above Owner       -3.849e-02  1.020e-02  -3.773 0.000163 ***
## ownerSecond Owner               -1.448e-02  2.860e-03  -5.065 4.24e-07 ***
## ownerTest Drive Car              6.860e-02  4.802e-02   1.428 0.153218    
## ownerThird Owner                -2.347e-02  4.952e-03  -4.739 2.21e-06 ***
## fuel_efficiency                 -3.005e-03  3.889e-04  -7.728 1.31e-14 ***
## max_power                        2.163e-03  1.563e-03   1.384 0.166432    
## make_categoryLuxury              9.183e+00  1.373e+01   0.669 0.503603    
## make_categoryMidrange           -8.729e+00  1.233e+01  -0.708 0.478942    
## year:make_categoryLuxury        -4.343e-03  6.784e-03  -0.640 0.522064    
## year:make_categoryMidrange       4.339e-03  6.085e-03   0.713 0.475808    
## max_power:make_categoryLuxury   -1.442e-03  1.589e-03  -0.907 0.364327    
## max_power:make_categoryMidrange  6.366e-04  1.563e-03   0.407 0.683796    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.08083 on 4983 degrees of freedom
## Multiple R-squared:  0.8416, Adjusted R-squared:  0.841 
## F-statistic:  1471 on 18 and 4983 DF,  p-value: < 2.2e-16
bic_refitted_wip = step(refitted_wip,direction="both",k=log(n),trace=0)
summary(bic_refitted_wip)
## 
## Call:
## lm(formula = log(selling_price_in_10k) ~ year + km_driven_in_10k + 
##     fuel + seller_type + owner + fuel_efficiency + max_power + 
##     make_category + max_power:make_category, data = train_data_filtered)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.42846 -0.04473  0.00582  0.05266  0.27131 
## 
## Coefficients:
##                                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                     -7.620e+01  8.944e-01 -85.201  < 2e-16 ***
## year                             3.838e-02  4.413e-04  86.959  < 2e-16 ***
## km_driven_in_10k                 1.225e-03  3.328e-04   3.680 0.000235 ***
## fuelPetrol                      -7.144e-02  2.898e-03 -24.652  < 2e-16 ***
## seller_typeIndividual           -1.521e-02  3.958e-03  -3.844 0.000123 ***
## seller_typeTrustmark Dealer      2.493e-02  1.664e-02   1.498 0.134131    
## ownerFourth & Above Owner       -3.911e-02  1.021e-02  -3.831 0.000129 ***
## ownerSecond Owner               -1.445e-02  2.862e-03  -5.050 4.57e-07 ***
## ownerTest Drive Car              4.230e-02  4.726e-02   0.895 0.370829    
## ownerThird Owner                -2.350e-02  4.956e-03  -4.743 2.17e-06 ***
## fuel_efficiency                 -3.006e-03  3.886e-04  -7.735 1.25e-14 ***
## max_power                        2.832e-03  1.280e-03   2.213 0.026965 *  
## make_categoryLuxury              4.893e-01  1.076e-01   4.549 5.52e-06 ***
## make_categoryMidrange            6.244e-02  9.167e-02   0.681 0.495801    
## max_power:make_categoryLuxury   -2.086e-03  1.312e-03  -1.590 0.111923    
## max_power:make_categoryMidrange -4.183e-06  1.280e-03  -0.003 0.997392    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.0809 on 4986 degrees of freedom
## Multiple R-squared:  0.8412, Adjusted R-squared:  0.8407 
## F-statistic:  1761 on 15 and 4986 DF,  p-value: < 2.2e-16
vif(bic_refitted_wip,type = "predictor")
## GVIFs computed for predictors
##                      GVIF Df GVIF^(1/(2*Df)) Interacts With
## year             1.957323  1        1.399044           --  
## km_driven_in_10k 1.704781  1        1.305672           --  
## fuel             1.583004  1        1.258175           --  
## seller_type      1.092041  2        1.022256           --  
## owner            1.393603  4        1.042359           --  
## fuel_efficiency  1.718578  1        1.310945           --  
## max_power        1.841547  5        1.062963  make_category
## make_category    1.841547  5        1.062963      max_power
##                                                                                       Other Predictors
## year             km_driven_in_10k, fuel, seller_type, owner, fuel_efficiency, max_power, make_category
## km_driven_in_10k             year, fuel, seller_type, owner, fuel_efficiency, max_power, make_category
## fuel             year, km_driven_in_10k, seller_type, owner, fuel_efficiency, max_power, make_category
## seller_type             year, km_driven_in_10k, fuel, owner, fuel_efficiency, max_power, make_category
## owner             year, km_driven_in_10k, fuel, seller_type, fuel_efficiency, max_power, make_category
## fuel_efficiency             year, km_driven_in_10k, fuel, seller_type, owner, max_power, make_category
## max_power                            year, km_driven_in_10k, fuel, seller_type, owner, fuel_efficiency
## make_category                        year, km_driven_in_10k, fuel, seller_type, owner, fuel_efficiency
plot(bic_refitted_wip,which=c(1,2))

bptest(bic_refitted_wip)
## 
##  studentized Breusch-Pagan test
## 
## data:  bic_refitted_wip
## BP = 674.62, df = 15, p-value < 2.2e-16

Results

Discussion

Appendix